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We investigate the tendency for financial instruments to form clusters when there are multiple 
factors influencing the correlation structure. Specifically, we consider a stock portfolio which con¬ 
tains companies from different industrial sectors, located in several different countries. Both sector 
membership and geography combine to create a complex clustering structure where companies seem 
to first be divided based on sector, with geographical subclusters emerging within each industrial 
sector. We argue that standard techniques for detecting overlapping clusters and communities are 
not able to capture this type of structure, and show how robust regression techniques can instead be 
used to remove the influence of both sector and geography from the correlation matrix separately. 
Our analysis reveals that prior to the 2008 financial crisis, companies did not tend to form clusters 
based on geography. This changed immediately following the crisis, with geography becoming a 
more important determinant of clustering. 


I. INTRODUCTION 

Financial markets provide a rich application area for 
the study of complex networks [?????]. In a 
financial network, nodes represent financial instruments 
while the edge weights denote the correlation between 
the price changes of these instruments over some period 
of time. The purpose of studying financial networks is to 
gain information about the underlying correlation struc¬ 
ture and to better understand how it is influenced by 
various real-world factors, which is of great importance 
to risk management [? ], investment theory [? ], and 
index tracking [? ]. 

Much work has focused on the task of detecting clusters 
in financial networks, corresponding to groups of compa¬ 
nies which have highly correlated stock price movements. 
It has been found that companies often form clusters 
based on the industrial sector to which they belong, with 
(for example) banks tending to behave differently than 
telecommunications companies [????]. These studies 
generally only focus on the analysis of companies located 
within a single country, however a similar analysis has 
been carried out currency exchange rates and national 
stock indexes [? ? ? ] which shows that there is also ev¬ 
idence of clustering based on geographical location, with 
East Asian currencies tending to behave differently than 
European ones, and so on. 

The existing literature on financial correlation net¬ 
works has mainly focused on situations where there is 
only a single factor which influences the correlation struc¬ 
ture - membership when studying individual companies, 
and geographical location when studying currency ex¬ 
change rates or national stock indexes. However, many 
real portfolios will be driven by multiple factors. For ex¬ 
ample, a portfolio containing companies which are spread 
across multiple sectors and located in several different 
countries may be expected to form clusters which are 
based on both sector and geography, with Spanish banks 
perhaps behaving in a different manner than British 


banks. An analysis which focused only on one clustering 
factor, such as sector, would be inadequate for this type 
of data. With this in mind, we develop techniques which 
allow multiple factors to drive the correlation structure. 
We show that in a naive analysis, it appears that it is 
mainly sector membership which influences clustering. 
However a closer look reveals that there is a semblance 
of geographical clustering, although this is swamped by 
sector effects. In order to reveal the true influence of ge¬ 
ography, we can make a transformation which strips the 
correlation matrix of the influence of sector membership. 

Finally, we investigate how the influence of sector and 
geography on the clustering structure evolves over time, 
with a particularly important change occurring during 
the recent financial crisis where geography starts to be¬ 
come a more important determinant in the countries 
which were most affected by the sovereign debt crisis. 
Understanding the time-dynamics of financial correla¬ 
tions is an important area of research [? ? ? ], and 
our analysis shows how this can sensibly be carried out 
in cases where there are multiple latent clustering factors. 


II. DATA 

The data we will analyse throughout consists of the 
daily returns of the 350 companies which compose the 
Standard & Poor (S&P) 350 European stock index. The 
companies in this index are among the largest in the con¬ 
tinent, and together make up approximately 70% of the 
total European market capitalization. 

We obtained the daily closing prices for each company 
between the dates of January 1 st 2003, and March 17 th 
2012. In order to avoid the difficult problem of synchro¬ 
nizing prices across substantially different time zones, 
we have excluded companies which trade only on non- 
European stock exchanges, along with those which ei¬ 
ther did not have stock quotes going back to 1 January 
2003, or which underwent mergers or acquisitions during 
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Sector 

Number Country 

Number 

Materials 

30 Great Britain 

79 

Consumer Staples 

24 France 

44 

Financials 

57 Germany 

33 

Utilities 

16 Sweden 

24 

Telecommunications Services 

15 Italy 

18 

Industrials 

45 Spain 

16 

Consumer Discretionary 

44 Switzerland 

12 

Information Technology 

10 Netherlands 

8 

Energy 

11 Greece 

5 

Health Care 

16 



TABLE I: Number of companies from each sector and 
country present in the data set 


this period. In total, there are 267 companies remaining 
for which we have data. In order to avoid issues related 
to the effects of different currencies, we converted the 
closing prices of each company into Euros by using the 
relevant historical exchange rate. 

Finally, Standard and Poor provide an official classifi¬ 
cation of these companies into industry sectors based on 
the Global Industry Classification Standard (GICS), with 
each company being assigned to one of 10 different possi¬ 
ble sectors. We will use this in order to check whether the 
clusters found in the data reproduce real patterns.The 
number of companies in the data set belonging to each 
sector and geographical region is shown in Table [TJ where 
we have omitted countries containing less than 5 compa¬ 
nies since such small samples may not allow for accurate 
analysis. It is important to note that the GICS sector 
classifications are very broad and many subdivisions ex¬ 
ist within each one; for example, the group labelled as 
Financials includes both banks and insurance companies 
which may behave in different ways, while the oil com¬ 
panies within the Energy sector may behave slightly dif¬ 
ferent from other non-oil energy companies. This point 
will be important when it comes to assessing the quality 


of clustering structures in Sec. V A 


III. CLUSTERING ALGORITHM 


Clustering is the process of assigning objects to groups 
such that objects in the same group are similar accord¬ 
ing to some specified distance metric, while companies 
in distinct groups are different. As in [? ? ]., we put a 
distance metric on the space of companies which is based 
on the correlation between their daily stock log returns. 
Let P- h t denote the stock price in Euros of company i 
on day t, and let = logPi jt — logP iit _i denote the 
logarithm of the daily return. The Pearson correlation 
between companies i and j based over the whole period 
off T equally weighted days is then: 
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The distance between any two companies i and j can 
then be defined as: 

Wij = ^ 2(1 - Pij ). ( 2 ) 

where it can be seen that Wij £ [0,1], with Wij = 0 if 
the companies are perfectly correlated. 

Although there are many possible algorithms for clus¬ 
tering, we choose to use average link agglomerative hi- 
erarchal clustering (HC) [? ]. This is because as noted 
above the GICS sector labels are broad, and there is rea¬ 
son to believe that there may be a hierarchy of sub-sectors 
under each broad label. A full description of the HC al¬ 
gorithm can be found in the above reference, we will only 
summarise it here. Given the distance matrix {W^}, HC 
performs clustering by recursively merging the companies 
into clusters based on greedy optimisation. Initially each 
of the n companies is assigned to its own cluster, result¬ 
ing in the n clusters {ci,..., c n } . Then, HC selects the 
pair of clusters which are closest together, and succes¬ 
sively merges them until only one cluster remains. The 
output of the HC algorithm is a dendrogram. This 
is a tree structure which shows the order in which the 
clusters were merged. Each company is represented as 
a leaf of the tree. When two clusters are joined by HC, 
a link between them is added. The dendrogram allows 
the hierarchal clustering structure of the data to be vi¬ 
sualized, since the relationship between companies and 
clusters and subclusters can be easily viewed. 


IV. ‘HIDDEN’ CLUSTERS 

Before proceeding with the analysis of the data, we 
first describe the problems which can arise when multiple 
factors are influencing the clustering structure. In a typi¬ 
cal stock portfolio, each company has an industrial sector 
membership and also a geographical location. In times of 
normal market behavior, we will later show that the clus¬ 
tering structure of the network is largely attributable to 
sector membership. However, there is a noticeable sub¬ 
clustering where for example banks located in the United 
Kingdom behave slightly different than banks located in 
Germany. This suggests that the structure of the corre¬ 
lation matrix is formed by the interaction of two distinct 
dimensions, one corresponding to sector membership and 
the other to geography. Because sector membership is 
usually the dominant factor influencing the correlation 
matrix, the effects of geography are somewhat masked in 
the raw data. 

This problem differs markedly from the usual sce¬ 
nario considered in the literature on detecting overlap¬ 
ping communities [? ] which assumes that, even though a 
node may belong to multiple communities, these different 
communities are distinct and well formed. An example 
of this situation is shown in Fig. [la] which shows a rep¬ 
resentation of two overlapping clusters colored red and 
green, plotted in a 2 dimensional space. A small number 
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FIG. 1: (a) shows a typical scenario where two clusters 
overlap, due to a number of points (colored in blue) 
belonging to both, (b) shows the ’hidden’ cluster 
situation where the effect of the feature which groups 
points as either circles or triangles is masked by the 
dominant feature based on color. 


of points belong to both clusters, and are plotted in blue 


In contrast, our situation is akin to that shown in Fig. lb 


Here each point has two binary features attached to it, 
say A and B. If feature A has a value of 1 then the point 
is colored red, and if it has a value 0 then it is colored 
blue. Similarly if feature B has a value of 1 then the point 
is represented by a circle, otherwise it is represented by a 
triangle. From the plot, it is clear that feature A has the 
largest effect on the clustering structure, with points be¬ 
ing very well separated based on which value they have. 
However feature B also has some weight, with points be¬ 
ing sub-grouped within each cluster based on this. If we 
were to naively run a clustering algorithm on this data, 
then it would not be able to detect the effect that feature 
B has on the clustering because this is masked due to the 
stronger effect of A. In order to reveal the true effect that 
B is having, we must first transform the data to remove 
the effect of feature A in a principled manner. 

A reasonable question which might be asked is: if the 
clustering structure implied by secondary factors such 
as geography are masked in the financial data to the 
extent that they are not immediately detectable, then 
why are they of interest? The simple answer is that the 
correlation structure of real data is usually not static, 
but instead changes over time. Although at one point in 
time the clustering based on geographical location may 
be masked, it may become more dominant in the future 
as we will show with reference to the 2008 financial crisis. 


V. STATIC ANALYSIS 

We begin the data analysis by working with the static 
correlation matrix defined over the whole 8 years of data, 
ignoring any time varying structure. In order to aid in¬ 
terpretation of the clusters found by the HC algorithm, 
it is useful to project the correlation matrix into a two- 
dimensional space in order to better visualize its struc¬ 
ture. Although the actual clustering is performed in 


the original high-dimensional space rather than in two- 
dimensions, such a representation can provide an inter¬ 
esting visual summary of the data 

Multidimensional scaling (MDS) can be used for this 
purpose. MDS is a traditional tool in statistical analy¬ 
sis which allows high dimensional data to be plotted on 
the Euclidean plane [? ]. Given a set of n objects and 
an arbitrary distance matrix {Wij} where W denotes 
the distance between object i and object j, MDS searches 
for a representation of the objects in two dimensions such 
that the distance between each pair of objects is as close 
as possible to the distance between them in the origi¬ 
nal higher dimensional space. In other words, it seeks a 
collection of n vectors Xi £ M 2 satisfying: 


E( 

*<i 




Fig. [2a] shows a MDS plot of the data where each 
company is represented by a colored point denoting its 
sector, with each sector having a different color. If com¬ 
panies in the same sector do behave in a similar man¬ 
ner then they will be close to each other on this plot, 
and we would see large groups of companies all of which 
have the same color. This is indeed what happens, with 
the companies in the Financials sector (colored green) in 
the bottom right corner being especially well separated 
from the rest. However, there is also substantial overlap 
between sectors; for example, looking at the top of the 
plot shows that while companies in the Industrials sector 
(blue) and Materials sector (cyan) companies do split off 
from the remainder of the companies, they tend to be 
quite similar to each other. This provides some indica¬ 
tion of why previous approaches to clustering based on 
sector such as [? ] have only achieved limited success; 
although there is sectoral grouping in the data, it is not 
strong enough for there to be a formation of completely 
disconnected sectors, and there is substantial overlap. 


Next, Fig. 2b shows the same MDS plot with the com¬ 
panies instead colored based on their geographical loca¬ 
tion rather than their sector membership. It is visually 
obvious that there is less clustering structure than in the 
previous plot, suggesting that geography does not have 
a strong influence on the clustering. However this naive 
analysis is slightly misleading. Consider the Industrial 
companies in the top right which were colored blue in 
the previous picture. Within these Industrials, there is a 
definite subclustering based on geography, which is most 
evident when looking at the group of five companies at 
the top of Fig. |2b| These form a distinct group within 
the Industrial sector and have a different color from the 
rest of that sector, showing that they are in a different 
country from the rest (Germany). A similar picture can 
be found looking at the Financial companies in the bot¬ 
tom right of Fig. |2b[ again within this sector it can be 
seen that those in the same country tend to be closer 
together, suggesting similar behavior. The same pattern 
can be observed throughout the plot, which suggests that 
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FIG. 2: Fig. (a) shows a MDS plot of the data with 
companies coloured based on their industrial sector 
membership. Fig. (b) shows the same, but coloured 
based on geographical location. Figs, (c) and (d) show 
the respective dendograms, with Fig. (c) coloured based 
on sector membership, and Fig. (d) coloured based on 
geographical location 


order to assess this further, we will introduce a more 
quantitative measure to assess the extent to which the 
clustering is based on sector and geography. Then, we 
show how a transformation of the data can remove the 
effects of sector to make the geographical groupings more 
visible. 


A. Purity 

To quantify the extent that the clusters found in the 
data correspond to real sectorial/geographical groupings, 
the dendrogram produced by HC must be compared 
with the information about sectors and geography ob¬ 
tained from SSzP. However this is subtle because as we 
mentioned in Sec. [Il] above, these groupings are rather 
broad and there may be several sub-sectors within each 
broad sector classification. A naive approach for assess¬ 
ing whether the clusters found by HC correspond to real 
sector/geography clusters would be to cut the dendro¬ 
gram using some suitable entropy-based measure to form 
hard clusters, and then compare these to the S&P la¬ 
bels. Let Ci be a cluster found in the data, and let S 3 be 
a collection of companies which share the same sector (or 
geographical) label according to S&.P. Then the overlap 
between Ci and Sj can be defined based on the number of 
companies they have in common .One possible measure 
for this is the Jaccard index [? ]: 


the data has a structure similar to that discussed in the 
previous section where, although sector is the most im¬ 
portant factor in determining the cluster structure of the 
data, geography also plays an important secondary role 
when it comes to subclustering companies within each 
sector.. In other words, the companies seem to have a 
structure reminiscent of Fig. |lb[ 

Finally, we perform hierarchal clustering on the data 
using the algorithm discussed in Sec. |III| Fig. [2c| shows 
the resulting dendrogram with companies colored based 
on their sector membership, while Fig. |2d| shows the 
same dendrogram but with coloring based on geography. 
The clusters found by HC are as expected given the previ¬ 
ous discussion, with the sector dendrogram showing more 
blocks of solid color (and hence more clustering structure) 
than the one which is colored based on geography, The 
Financial companies in the bottom right corner (green 
in color, dark in greyscale) form an especially well de¬ 
fined cluster, but companies within the same sector tend 
to be clustered together throughout the dendrogram. In 
contrast the geography dendrogram is much more frag¬ 
mented; it can be seen that neighboring companies are 
generally connected to those in the same country (as rep¬ 
resented by many groups of 2-3 identical colors), however 
there is no large scale cluster formation due to the fact 
that companies are primarily split based on sector. In 


\Ci n SA 

\CiUSj\- 

However this will not work for our purposes; suppose 
that the clustering algorithm splits the Financial compa¬ 
nies into two subgroups, one containing banks and the 
other containing insurance companies. Both will have a 
low Jaccard index since neither corresponds exactly with 
the S&P list of Financial companies. To avoid this prob¬ 
lem, we instead define a notion of purity which takes the 
hierarchal structure of the data into account, and which 
uses the dendrogram directly. Suppose we wish to quan¬ 
tify the extent to which companies in some sector 5/ are 
clustered together in the HC dendrogram. Let Ci and 
c 3 be two arbitrary companies belonging to this sector. 
Recall from Sec. m that the HC algorithm hierarchally 
merges clusters until there is only one cluster remaining, 
which contains every company in the data set. There will 
therefore be a lowest point on the dendrogram where Ci 
and Cj are merged into the same cluster. Intuitively, this 
is the smallest cluster in the data containing both Ci and 
c 3 . The fraction of points in this cluster which belongs to 
the same sector as Cj and Cj is then computed. The pro¬ 
cedure is then repeated for all pairs of companies in the 
sector, and the average value denotes the sector purity. 
A similar procedure can be used to measure the purity of 
each country. This procedure has previously been used 
by [? ] in a slightly different context. 
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Before Transformation 


After Transformation 

Sector 

Purity Country 

Purity 

Country 

Purity 

Consumer Discretionary 

0.49 (< 0.001) France 

0.22 (0.004) 

France 

0.26 (< 0.001) 

Consumer Staples 

0.53 (< 0.001) Germany 

0.18 (0.005) 

Germany 

0.40 (< 0.001) 

Energy 

0.83 (< 0.001) Great Britain 

0.38 (0.003) 

Great Britain 

0.75 (< 0.001) 

Financials 

0.75 (< 0.001) Greece 

1.00 (< 0.001) 

Greece 

1.00 (< 0.001) 

Health Care 

0.55 (< 0.001) Italy 

0.25 (< 0.001) 

Italy 

0.32 (< 0.001) 

Industrials 

0.33 (< 0.001) Netherlands 

0.10 (0.016) 

Netherlands 

0.16 (0.002) 

Information Technology 

0.93 (< 0.001) Spain 

0.23 (< 0.001) 

Spain 

0.93 (< 0.001) 

Materials 

0.47 (< 0.001) Sweden 

0.41 (< 0.001) 

Sweden 

0.93 (< 0.001) 

Telecommunications Services 0.53 (< 0.001) Switzerland 

0.11 (0.007) 

Switzerland 

0.15 (0.004) 

Utilities 

0.77 (< 0.001) 





TABLE II: The left hand table shows the purity of each sector and country before any adjustment is made to the 
data. The right hand table shows the purities after the adjustment discussed in Sec. [Vl]has been performed to 
remove the effects of sector/geography membership respectively. P-values are given in brackets. 


In Table Ha (on the left), we show the purity scores 
for each of the sectors and countries in the data. It can 
be seen that these scores broadly reflect the qualitative 
features of the data we noted in the previous section, with 
the typical sector-based purity scores being substantially 
higher than those of the geography groupings. 


To aid interpretation of Table |IIa[ it is important to 
assess whether the high purity scores represent real struc¬ 
ture in the data, or whether they are purely a result of 
statistical noise. For this purpose, we use a measure of 
statistical significance similar to that of a permutation 
test [? ]. Suppose that a particular sector or country 
contains M companies. In order to assess whether the 
observed purity score is significant, it should be com¬ 
pared to the distribution of purity scores that a random 
sample of M unrelated companies would have, since the 
latter represents the score that would be expected to arise 
through pure chance. The purity score distribution of a 
random sample can be estimated in the same manner as 
a permutation test [? ], which is commonly used in sta¬ 
tistical applications. First, select a random sample of M 
companies from the data set, where each company has 
equal probability of being included in the sample. Next, 
compute the purity score of this sample; since the com¬ 
panies have been selected randomly, it is unlikely that 
the resulting score will be high. By repeating this pro¬ 
cedure a large number of times using different randomly 
selected samples, the purity score distribution can be ap¬ 
proximated. A p-value can then be computed for each 
of the purity scores in Table |IIa| based on the probabil¬ 
ity of a random sample of the same size having an equal 
or greater purity. As shown in the table, the resulting 
p-values are all very small, which suggests that the pu¬ 
rity scores found for the sectors and countries given in 
Table |IIa| do indeed correspond to genuine structures in 
the data. 


VI. DISCOVERING HIDDEN CLUSTERS 


In order to reveal the influence that geographical loca¬ 
tion has on the clustering structure, we propose adjusting 
each company’s return series to remove the effects of sec¬ 
tor membership. If the effect that sector factors have on 
the returns is eliminated in a principled way, then the 
effect of geography should be clearer and less masked in 
the transformed data. 

In previous studies of financial price movements, it has 
been found that a large percentage of the observed cor¬ 
relation structure can be explained by a single ‘market 
factor’ which represents the market economy as a whole. 
In other words, much of the correlation between individ¬ 
ual companies can be explained by the fact that each 
company is highly correlated with a global market fac¬ 
tor. This phenomena has been found in many different 
financial data sets, often using Random Matrix Theory 
[? ]. Recently [? ] showed that if the correlation ma¬ 
trix is adjusted to remove the effect of the market factor, 
then the structure of the matrix becomes much less noisy 
- after the correlation that each company has with the 
global factor has been removed, the remaining residual 
correlations allows clustering effects to be more clearly 
determined. 

In [? ], several methods for removing the effects of 
the global market factor are assessed. Each of these es¬ 
sentially involves subtracting the daily price returns of 
the market factor from the daily price returns of each in¬ 
dividual company. The method which they found most 
successfully decouples the company correlations from the 
market factor, is as follows. First, the component of the 
returns associated with the market factor is estimated by 
a synthetic ‘pseudo-index’ defined as the average returns 
of all the stocks in the portfolio: 



i =1 


( 3 ) 














6 


The effect that this pseudo-index has on the returns of 
each stock can be removed by using regression. In a 
standard one variable linear regression model, the rela¬ 
tionship between a set of dependent variables {Y }j and 
a set of independent variables {X}i is modeled as 

Yi = a + j3X'i + ej, e ~ F 

where F is some symmetric distribution with mean 0. In 
this model, the expected value of Y t is po + PiX t , while 
each ei represents the residual variation in Y t which is not 
explained by Xj. The values of a and /? can be estimated 
using standard ordinary least squares. 

If the returns of the pseudo-index are regressed on the 
returns of each stock, this leads to the model: 

1~i,t = OLi + fiiRt + , (4) 

where r,.t denotes the return of the i th stock on day t. 
The residuals eqt then represents the components of the 
returns of stock i which are not explained by the market 
factor. It is shown in [? ] that if each r, t is replaced 
with when it comes to computing the correlations in 
Eqn. [I] used for clustering, then the results tend to be 
less noisy since the correlation which each stock has with 
the market factor is removed. Note that it would be also 
possible to perform the same adjustment by regressing 
the returns of each company onto those of a real stock 
index such as the S&P350 itself, so that R t above is 
replaced with the daily returns of this index. However 
we choose to use the pseudo-index from Eqn. [3] instead, 
since [? ] found via experimental analysis that using 
such a pseudo-index is more effective at decorrelating the 
companies than using a real index. 

We propose using a similar approach to remove the 
effects of sector membership. Rather than constructing 
a pseudo-index to represent the market as a whole, we 
instead construct a separate pseudo-index for each sector 
separately, by averaging the returns of only the compa¬ 
nies in that sector. Then, each company is regressed 
against the pseudo-index corresponding to the sector it 
belongs to. In this way, the sector influence on each com¬ 
pany is removed. Suppose that the j th GICS industrial 
sector S J contains the M companies with return vectors 
r’(i),..., 77 m)- The corresponding sector pseudo-index is 
then: 

1 M 

S t = ( 5 ) 

i=1 

Each company then has the component of its returns 
vector associated with its sector membership being re¬ 
moved, by being regressed against the corresponding sec¬ 
tor index: 


r (i),t 


= at. 






( 6 ) 



FIG. 3: MDS projection and dendrogram of the data 
after adjustment to remove the effects of sector 
membership, with companies colored based on their 
geographical location. It can be seen that removing the 
effect of sector membership gives a much clearer picture 
of how geographical location affects clustering. 


company are replaced with the residual e^w, represent¬ 
ing the component of variance which is not associated 
with sector membership. As the effect of sector mem¬ 
bership will then be largely removed from the data, the 
effects of geographical location should more clearly show 
up in the clustering structure. 

However a problem occurs when using the above re¬ 
gression technique. Because we are considering each sec¬ 
tor separately, there will only be a small number of com¬ 
panies which contribute to each sector pseudo-index SY 
As financial price return series tend to be heavy-tailed 
and non-Gaussian [? ], there is a risk that an extreme 
price movement in the returns of one company will dras¬ 
tically skew the results. We therefore instead use tech¬ 
niques from the field of robust regression analysis. First, 
rather than defining the pseudo-index as the mean of the 
ru) : t returns for each stock in the sector, we instead de¬ 
fine it as the median. It is well known that the sample 
median is much less sensitive to outliers than the sample 
mean. We will now write Sj for the resulting median- 
based pseudo-index for sector j on day t. 

Although this leads to more representative pseudo¬ 
indexes, if the regression coefficients cq and pi are es¬ 
timated via ordinary least squares, they will have a high 
variance if the residuals £u).t come from a non-Gaussian 
heavy tailed distribution. We therefore instead use the 
Theil-Sen estimator to estimate each and pi rather 
than the standard least squares estimator. The Theil- 
Sen approach is widely used when performing regression 
in situation where there may be extreme values, since it 
is more robust against outliers than the ordinary least 
squares [? ]. The Theil-Sen estimates can be computed 
as follows: for each company i in sector j, write the data 
as the ordered pairs with t = 1,2, The 

slope estimator pi is then defined as the median of all 
pairwise slopes between these pairs, i.e.: 


Then, when it comes to computing the correlation ma¬ 
trix from Eqn. [l] used for clustering, the returns of each 


Pi = median 


Ti,m ^i,n 

qj _ qj 


m ^ n, 
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where m and n run over all ordered pairs. Given this 
estimate, the intercept estimator fa is defined as the me¬ 
dian of the r^t— faS° t values. As this estimator is defined 
in terms of medians, it is highly robust to outliers in the 
data and will give a better fit to the data than using the 
standard least squares approach. 

After the price returns of each company have been 
transformed to remove the effects of sector membership 
according to Eqn. [6] with the a,; and fa coefficients es¬ 
timated using the Theil-Sen method, the clustering al¬ 
gorithm was rerun on the data. Fig. [3a] shows a MDS 
plot of the companies colored by country, after the ef¬ 
fect of sector membership has been removed. Compared 
to Figure 2b from Sec. [V]the geographical clusters are 
now much more distinct, and there are several clear clus¬ 
ters. The green (dark in grayscale) cluster in the bot¬ 
tom left, which represents the countries located in the 
United Kingdom, make an especially distinct group, as 
do the brown Swedish companies in the upper right. Fig. 
|3b| shows the dendrogram which is produced by running 
the hierarchal clustering algorithm on this adjusted data; 
as expected, the geographical clusters found are much 
stronger than before, with much thicker bands of color. 

In order to quantitatively assess the impact which the 
removal of sector membership has on the purity of the 
geographical clusters found by HC, Table lib gives the 
new purity associated with each country. These scores 
are now much higher than those in Tablc |IIa| before the 
regression adjustment was made, with almost every coun¬ 
try showing a much higher purity, particularly Finland, 
Germany, the United Kingdom, Portugal, Spain and Swe¬ 
den. This analysis shows that contrary to first impres¬ 
sions, geography plays a strong role in determining the 
clustering structure. It is only the fact that sector mem¬ 
bership has an even stronger influence which masks this. 


VII. DYNAMIC ANALYSIS 

We now investigate how the clustering purity of each 
country changes over time. This is important since there 
is no prior reason to believe that the effect of geography 
will be static, particularly with the recent financial crisis 
and sovereign debt concerns. As discussed in the Intro¬ 
duction, this has potential implications for risk manage¬ 
ment since assessing the risk of a portfolio should take 
into account potential changes in the dynamic correla¬ 
tion structure, and the potential for companies to cluster 
based on different factors at different times. 

In order to study how the clustering structure evolves 
over time, we must allow the correlation pij between com¬ 
panies i and j to be time-varying, rather than fixed over 
the whole period as it was in Eqn. [l] in the static case. 
Given an appropriately defined time-varying correlation 
Pij,t, a dynamic distance measure between the companies 
can be defined in a similar way as in the static case: 

Wij, t = ^2(1 - (7) 


In order for pij t to accurately estimate the true (dy¬ 
namic) correlation at time t, it must be defined in a way 
which gives most weight to the recent stock prices at t 
and less weight to older prices. The usual way to imple¬ 
ment this [? ? ] is via the notion of a sliding window 
where the correlation at time t is computed using only 
the most recent w observations, for some choice of the 
window length w. However we feel that this weighting is 
slightly unrealistic;, with the last w observations receiv¬ 
ing equal weights and then an abrupt drop-off where older 
observations receive no weight at all. Therefore we in¬ 
stead use exponential forgetting which allows the weight 
assigned to older observations to decay more smoothly. 
Let A £ [0,1] denote a forgetting factor, where a value 
closer to 1 results in more weight being given to recent 
observations. The time-varying variance of company i 
can then be defined as: 

a lt = (! - x Wi,t -1 + A U 2 ,*> of.o = 0 

and the time-varying correlation between companies i 
and j can be similarly defined: 

/-i \\ i \ U. tt'j , t. n 

Pij,t (1 A) pij.t "fi A , Pij ,0 0. 

j,t 

This implements exponential smoothing through a 
weighted moving average which assigns higher weights 
to more recent observations, with A controlling the rate 
at which older observations are forgotten. We used the 
value A = 0.01 in the following analysis in order to re¬ 
duce the effect of short-term fluctuation, particularly in 
the countries where data on only a small number of coun¬ 
tries is available. The distance matrix can then be de¬ 
fined at each time step as Wjj.t above, which is used for 
computing the clustering using HC 

Note also that we have marked the returns f with a 
tilde to emphasize that we are using the data which has 
had sector-influence removed via the techniques intro¬ 
duced in the previous section - as will be seen below, 
this makes the dynamic effect of geographical much more 
visible. 

For each of the 2225 days in the 8 year period, we gen¬ 
erated the corresponding Wij t t matrix. The HC cluster¬ 
ing algorithm was run over each matrix, and the purity 
scores of the resulting clusters were computed on each 
day. Figures |4a| and |4b| show how the clustering purity 
for several countries changes over time. These plots show 
that that in the countries most affected by the financial 
crisis, namely Spain, Greece and Italy, there is a very no¬ 
ticeable increase in the clustering purity after 2008—2009, 
corresponding to the beginning of the European sovereign 
debt crisis. This is broadly consistent with other litera¬ 
ture [? ? ] which suggests that financial assets become 
more correlated in times of crisis. However our analy¬ 
sis clearly shows that this does not affect all countries 
equally, since Great Britain, Germany, France and the 
Netherlands do not show such a sharp increase, which 
suggests companies in these countries are not clustering 
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FIG. 4: Figs, (a) and (b) show how the cluster purity scores for each country change over time after the influence of 
sector membership has been removed via the technique in Sec. |VI| The increased purity scores of the most crisis-hit 
countries after 2008-2009 are very prominent .Figs, (c) and (d) show the same purity scores computed on the data 
that has not had sector membership removed. The dynamic structure is clearly masked in these, and cannot be seen. 


together to a increased extent. Interestingly, companies 
in Sweden also displayed substantially higher clustering 
in the aftermath of the financial crisis. This may ini¬ 
tially seem surprising since this was one of the countries 
most unaffected by global instability. However, it is well 
known that in times of market uncertainty, investors pre¬ 
fer to invest in assets which are considered relatively safe 
in order to hedge against wider risk. This phenomena 
is known as the “flight to quality” [? ] and is a likely 
explanation of the increased correlation between Swedish 
assets, as risk-averse investors flocked to invest in these 
safe-havens. 


For reference, Figures 4c and 4c show the results of 
the same analysis carried out on the raw data which has 
not had the influence of sector membership removed as 
in Sec. VI In this case there are no obvious geographical 
patterns in the data, and the increase in clustering in 
the crisis-hit countries after 2008 cannot be seen. This 
reenforces the results previously shown in Table |IIb| that 
it is important to remove sector effects before studying 
geography; if this is not done then the country effects are 
invisible. 


VIII. CONCLUDING REMARKS 

In this work we have studied the effect that multiple 
correlation factors can have on the clustering structure of 
a financial portfolio. Unlike previous studies which have 
considered the effect of sector membership and geograph¬ 
ical location separately, we studied both simultaneously 
and have shown that the interaction between these fac¬ 
tors can be complex, and not easily handled by existing 
techniques for identifying overlapping communities. 

In order to reveal the time-varying dynamics of the 
clustering structure, we proposed a method for adjusting 
the observed correlation matrices to remove the effects of 
both sector membership and geography separately. After 
making this adjustment, the effect that both factors have 
on the clustering becomes much clearer. This allows the 
clustering structure to be studied over time. Our anal¬ 
ysis reveals that between the years of 2003 and 2008, 
companies located in the same country did not tend to 
cluster together. However after the 2008 financial crisis 
this pattern changed, and geography started to become 
an important determinant of the clustering. This is es¬ 
pecially the case in the countries which were hit hardest 
by the crisis, such as Spain and Italy. 























